tidycensusIn this tutorial, we will work through several exercises using the tidycensus R package to fetch, wrangle, and map census data.
The key tidycensus functions we will use today are:
census_api_key: makes your Census API key available
to tidycensus
load_variables: retrieves a dataframe of available
census data variables
get_decennial: fetch census data from a recent
decennial censuses - 2000, 2010 (and soon 2020)
get_acs: fetch census data from an ACS (American
Community Survey) 1 or 5 year dataset, 2005 - 2020.
Be sure to clone or downloaded and unzip the workshop files from: https://github.com/dlab-berkeley/Census-Data-in-R
Then:
Open the folder with the workshop files
Double-click on the R Project file
Census-Data-in-R.Rproj
This should open RStudio - with the Files panel
displaying the workshop folder contents.
Double-click on the file Census-Data-in-R.Rmd to
follow along!
You can also click on the file Census-Data-in-R.html
in the Files tab to open the workshop tutorial in a web brower.
If you installed any of these packages awhile ago, (especially
tidycensus), it’s a good idea to install updates when you can (though not during the workshop as things can break!).
# Uncomment this to install packages, if necessary.
# install.packages(c("here", "tidyverse", "sf", "leaflet", "mapview", "tigris", "tidycensus"))
library(here)
## here() starts at /Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.1.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(sf)
## Warning: package 'sf' was built under R version 4.0.5
## Linking to GEOS 3.9.1, GDAL 3.4.0, PROJ 8.1.1; sf_use_s2() is TRUE
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.0.5
library(mapview)
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(tigris)
## Warning: package 'tigris' was built under R version 4.0.5
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.0.5
##
## Attaching package: 'tidycensus'
## The following object is masked from 'package:tigris':
##
## fips_codes
These seven libraries should be loaded in your environment now.
# If you run this chunk, output from the "here" function should be visible below. This is your local directory path. We can use this to import files later on.
here()
## [1] "/Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22"
You need a Census API key to programmatically fetch
census data.
Get it here (pretty quickly): https://api.census.gov/data/key_signup.html
The key will be sent to your email and you will need to click to activate it.
Keep the email with the key open for use in this workshop.
For more info on all available Census APIs see: https://www.census.gov/data/developers/data-sets.html
To use your Census API Key in R
Copy and paste your Census API key from your email
Use the tidycensus function census_api_key to register your API key with tidycensus. Don’t forget to put quotes around the key!.
# Install your census api key - long alphanumeric string
census_api_key("THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS")
Another way to add your Census API Key:
I keep my key in a file so no one can see it. One way to do this is
by making a script that creates a variable key, and then using the
source function to add that script as an object into your
coding environment. The code chunk below is an example of how you might
do that:
# source (run) an r script that creates a variable with my key
#source("/Users/pattyf/Documents/Dlab/workshops/keys/census_api_key.R")
#print(my_census_api_key)
# register the key
census_api_key(key = my_census_api_key)
get_decennial functionWe start by fetching total population from the 2010
Census with tidycensus’s get_decennial function. Let’s
first talk about the code.
pop2010 <- get_decennial(geography = "state", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010) # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama P001001 4779736
## 2 02 Alaska P001001 710231
## 3 04 Arizona P001001 6392017
## 4 05 Arkansas P001001 2915918
## 5 06 California P001001 37253956
## 6 22 Louisiana P001001 4533372
We can pass a vector of census identifiers to the
variables function argument if we want to get data for more
than one variable. Below we add P0002002 for population in
urban areas.
pop2010 <- get_decennial(geography="state",
variables = c("P001001","P002002"), # variable(s) of interest
year = 2010) # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# take a look
head(pop2010)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama P001001 4779736
## 2 02 Alaska P001001 710231
## 3 04 Arizona P001001 6392017
## 4 05 Arkansas P001001 2915918
## 5 06 California P001001 37253956
## 6 22 Louisiana P001001 4533372
We can see the data for both variables if we sort the output by county name.
# Sort dataframe by county names (the NAME column)
pop2010 %>% arrange(NAME) %>% head()
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama P001001 4779736
## 2 01 Alabama P002002 2821804
## 3 02 Alaska P001001 710231
## 4 02 Alaska P002002 468893
## 5 04 Arizona P001001 6392017
## 6 04 Arizona P002002 5740659
tidy dataBy default, tidycensus returns data in a tidy, or
long format that allows data for multiple variables to be
contained within the variable and value
columns.
This is in contrast to untidy, or wide data where each
variable is in its own column.
tidycensus can return wide data if you can add the
parameter output=wide to the function call.
# wide format
pop2010w <- get_decennial(geography = "state", # census tabulation unit
variables = c("P001001","P002002"), # variable(s) of interest
year = 2010, # census year
output="wide") # get output in wide format
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010w)
## # A tibble: 6 x 4
## GEOID NAME P001001 P002002
## <chr> <chr> <dbl> <dbl>
## 1 01 Alabama 4779736 2821804
## 2 02 Alaska 710231 468893
## 3 04 Arizona 6392017 5740659
## 4 05 Arkansas 2915918 1637589
## 5 06 California 37253956 35373606
## 6 22 Louisiana 4533372 3317805
GEOID columnThe GEOID column is included in tidycensus output by
default.
This is a Census geographic identifier for the tabulation unit.
The GEOID is sometimes called the Census
FIPS code and for most tabulation units these are the
same.
The GEOID makes it possible to link to Census
demographic data to Census geographic data and make maps. We will do
this in a bit.
The GEOID is a text string and must be quoted.
Question: What is the GEOID for California?
Public census data is typically aggregated by census geographies to protect privacy.
These census geographies are called
Census tabulation units.
Some of the most common geographic tabulation units and their tidycensus function abbreviations are shown below, along with required and available filters that limit what data are returned.
| Geography | Definition | Filter(s) | Used in tidycensus |
|---|---|---|---|
| “us” | United States | get_acs(), get_decennial() | |
| “region” | Census region | get_acs(), get_decennial() | |
| “state” | State or equivalent | state | get_acs(), get_decennial() |
| “county” | County or equivalent | state, county | get_acs(), get_decennial() |
| “place” | Census place | state | get_acs(), get_decennial() |
| “tract” | Census tract | state, county | get_acs(), get_decennial() |
| “block group” | Census block group | state, county | get_acs(), get_decennial() |
| “block” | Census block | state, county | get_decennial() only! |
get_decennial Tabulation Units and FiltersLet’s take a few minutes to practice fetching population data with
the get_decennial function.
?get_decennial for helpOpen Census-Data-in-R-Challenges.Rmd and use the
get_decennialfunction like we’ve seen above to fetch population data. Solutions are available in the Solutions folder, as needed.
Let’s fetch 2010 population data for CA counties
What changes in the code?
get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010, # census year
state='CA') # Filter by state is CA
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 58 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06011 Colusa County, California P001001 21419
## 2 06007 Butte County, California P001001 220000
## 3 06001 Alameda County, California P001001 1510271
## 4 06003 Alpine County, California P001001 1175
## 5 06005 Amador County, California P001001 38091
## 6 06009 Calaveras County, California P001001 45578
## 7 06013 Contra Costa County, California P001001 1049025
## 8 06015 Del Norte County, California P001001 28610
## 9 06031 Kings County, California P001001 152982
## 10 06021 Glenn County, California P001001 28122
## # … with 48 more rows
state= filter?You can also filter tidycensus results by county
get_decennial(geography = "county", # census tabulation unit
variables = "P001001", # variable(s) of interest
year = 2010, # census year
state='CA', # Filter by state is CA
county='Alameda') # Filter by county Alameda
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 1 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06001 Alameda County, California P001001 1510271
In Census-Data-in-R-Challenges.Rmd, alter the code above to fetch 2010 population for Alameda & San Francisco Counties. Then try Challenge 2B & 2C.
We can visualize data to get a quick overview of the distribution of the values.
It’s a first step in exploratory data analysis and a last step in data communication.
ggplot2 is the most commonly used R package for data
visualization.
tidyverse package.Let’s use it to visualize the population data.
Use ggplot2 to create an ordered horizontal bar
chart.
# create a plot.
pop_plot <- ggplot(data=pop2010,
# set aesthetic variables
aes(x=value/1000000, y=reorder(NAME,value)) ) +
# pick geometry
geom_bar(stat="identity") +
# add theme and titles.
theme_minimal() +
labs(title = "2010 US Population by State") +
xlab("Population (in Millions)") +
ylab("State")
# display the plot.
pop_plot
Developing your ggplot2 knowledge can really enhance
your data analysis skills.
In combination with tidycensus it creates a powerful,
reproducible data science workflow.
In the code above we fetched data for total population in 2010 using
the variable "P001001".
That is not an obvious variable name, so how do we get those census data identifiers?
We can use the tidycensus load_variables function for
this.
load_variables functionFor any census dataset like the decennial census or the ACS 1 or
5-year estimates, use the load_variables function to fetch
all available variables and identifiers.
Since these datasets have many, many variables, save the resulting dataframe to a variable and cache it locally so you do not need to repeatedly retrieve it over the web.
vars2010 <- load_variables(year=2010, # Year or end year for ACS-5yr
dataset = 'sf1', # 'sf1' for decennial census
cache = TRUE) # Save fetched data locally
# How large is the output
dim(vars2010)
## [1] 8959 3
# Take a look with head or View
head(vars2010)
## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 H001001 Total HOUSING UNITS
## 2 H002001 Total URBAN AND RURAL
## 3 H002002 Total!!Urban URBAN AND RURAL
## 4 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
## 5 H002004 Total!!Urban!!Inside urban clusters URBAN AND RURAL
## 6 H002005 Total!!Rural URBAN AND RURAL
Over 3,000 unique variables that describe population and housing characteristics
Organized in 333 Tables
See: https://www.census.gov/data/datasets/2010/dec/summary-file-1.html
We know this from our previous code blocks, but let’s find it for practice navigating the dataframe.
vars2010 dataframe to find
it.What 2010 decennial census variable contains…
Median Age
Average Family Size
Number of occupied housing units
*See Census-Data-in-R-Solutions.Rmd if needed (under Variable Questions)
Return to Census-Data-in-R-Challenges.Rmd and use the
get_decennialfunction to fetch and plot anAvg Family Sizevaraible by CA County in2010, and name the call as a dataframe,ca_fam_size. Once you’ve done that, plot the dataframe with theggplotcall below. Hint: “P037001”
Repeat the previous challenge with data from the
2000decennial census. Don’t assume variable names are the same across the 2000 and 2010 census
Use
load_variablesto check the variable name!
Census tracts are the most commonly used census tabulation unit.
Let’s fetch population data for the census tabulation unit to tract
Because of the large number of census tracts, you MUST specify a state when requesting these data with tidycensus.
## Fetch population by **tract** for California.
ca_tract_pop2010 <- get_decennial(geography = "tract", # census tab unit
variables = "P001001", # var of interest
year = 2010, # census year
state='CA') # State filter
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# How many tracts in CA
dim(ca_tract_pop2010)
## [1] 8057 4
# take a look
head(ca_tract_pop2010)
## # A tibble: 6 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 06037434004 Census Tract 4340.04, Los Angeles County, Californ… P001001 2796
## 2 06037460000 Census Tract 4600, Los Angeles County, California P001001 4851
## 3 06037460200 Census Tract 4602, Los Angeles County, California P001001 5315
## 4 06037460301 Census Tract 4603.01, Los Angeles County, Californ… P001001 4638
## 5 06037460302 Census Tract 4603.02, Los Angeles County, Californ… P001001 4442
## 6 06037460401 Census Tract 4604.01, Los Angeles County, Californ… P001001 878
Census tract data can be quite large!
Fortunately, you can also limit the results to one or more counties.
tract_pop2010 <- get_decennial(geography = "tract", # census tabulation unit
variables = "P001001", # variable of interest
year = 2010, # census year - only one!
state="CA", # limit to California
county=c("Alameda","Contra Costa")) # & counties
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
dim(tract_pop2010)
## [1] 569 4
What two things are new here?
#urban and rural pop for 3 CA counties
ur_pop10 <- get_decennial(geography = "county", # census tabulation unit
variables = c(urban="P002002",rural="P002005"),
year = 2010,
summary_var = "P002001", # The denominator
state='CA',
county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
We have already specified more than one variable:
variables = c("P002002","P002005")
variables = c(urban="P002002",rural="P002005")
summary_var (a denominator - here,
the total count of all people or households surveyed. Can be used for
calculations like percent of total.)summary_var = "P002001"
# take a look at the results
ur_pop10
## # A tibble: 6 x 5
## GEOID NAME variable value summary_value
## <chr> <chr> <chr> <dbl> <dbl>
## 1 06045 Mendocino County, California urban 48110 87841
## 2 06055 Napa County, California urban 118194 136484
## 3 06097 Sonoma County, California urban 424102 483878
## 4 06045 Mendocino County, California rural 39731 87841
## 5 06055 Napa County, California rural 18290 136484
## 6 06097 Sonoma County, California rural 59776 483878
The summary_value column comes in handy when you want to
compute percent of total, for example:
# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
mutate(pct = 100 * (value / summary_value))
# Take a look at the output.
ur_pop10
## # A tibble: 6 x 6
## GEOID NAME variable value summary_value pct
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 06045 Mendocino County, California urban 48110 87841 54.8
## 2 06055 Napa County, California urban 118194 136484 86.6
## 3 06097 Sonoma County, California urban 424102 483878 87.6
## 4 06045 Mendocino County, California rural 39731 87841 45.2
## 5 06055 Napa County, California rural 18290 136484 13.4
## 6 06097 Sonoma County, California rural 59776 483878 12.4
A plot gives us compact visual summaries of the data.
## Plot it with ggplot2
myplot <- ggplot(data = ur_pop10,
mapping = aes(x = NAME, fill = variable,
y = ifelse(test = variable == "urban",
yes = -pct, no = pct))) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = abs, limits=c(-100,100)) +
labs(title="Urban & Rural Population in Wine Country",
x="County", y = " Percent of Population", fill="") +
coord_flip()
myplot
Don’t worry if you don’t get all the ggplot code now. It’s here for reference.
You can use your R skills to reformat the data and make it more usable.
Let’s fetch population data for 2010 and 2000 by state.
Then we will combine these into one data frame using the
tidyverses::bind_rows function
# Fetch 2000 population data by state
pop2000 <- get_decennial(geography = "state",
variables = c(pop2000="P001001"),
year = 2000)
## Getting data from the 2000 decennial Census
## Using Census Summary File 1
# Fetch 2010 population data by state
pop2010 <- get_decennial(geography = "state",
variables = c(pop2010="P001001"),
year = 2010)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Use tidyverse `bind_rows` function to combine the data for these years
state_pop <- bind_rows(pop2000, pop2010)
# Take a look with head or View
state_pop %>% arrange(NAME) %>% head(10)
## # A tibble: 10 x 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 01 Alabama pop2000 4447100
## 2 01 Alabama pop2010 4779736
## 3 02 Alaska pop2000 626932
## 4 02 Alaska pop2010 710231
## 5 04 Arizona pop2000 5130632
## 6 04 Arizona pop2010 6392017
## 7 05 Arkansas pop2000 2673400
## 8 05 Arkansas pop2010 2915918
## 9 06 California pop2000 33871648
## 10 06 California pop2010 37253956
tidycensus outputThe data we fetch using tidycensus is stored in an R
dataframe.
We can use the write.csv or write_csv
function to save the contents of a dataframe to a CSV file.
write_csv(state_pop, here('data_out/state_pop_2010.csv') )
Any Questions?
You can fetch census geographic data by adding the parameter
geometry=TRUE to tidycensus functions
tigris package to
fetch data from the Census Geographic Data APIs.You can then use your favorite R mapping packages like
sf, ggplot, tmap,
mapview and leaflet to make maps.
Before fetching census geographic data, we need to set the option
tigris_use_cache to TRUE
Caching saves data locally. This greatly speeds things up if you fetch the same census geographic data repeatedly.
# Tigris options - used by tidycensus
# Cache retrieved geographic data locally
options(tigris_use_cache = TRUE)
tidycensusWe fetch the census geographic data by setting geometry=TRUE.
pop2010geo <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
year = 2010,
output="wide",
geometry=TRUE) # Fetch geometry data for mapping
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
Let’s take a minute to discuss the format of an sf
spatial object.
head(pop2010geo, 3)
## Simple feature collection with 3 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -90.41814 ymin: 41.23796 xmax: -66.9499 ymax: 48.19097
## Geodetic CRS: NAD83
## # A tibble: 3 x 4
## GEOID NAME pop10 geometry
## <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 23 Maine 1.33e6 (((-67.61976 44.51975, -67.61541 44.52197, -67.58774 …
## 2 25 Massachus… 6.55e6 (((-70.83204 41.6065, -70.82373 41.59857, -70.82092 4…
## 3 26 Michigan 9.88e6 (((-88.68443 48.11578, -88.67563 48.12044, -88.67639 …
The tidycensus package uses the R sf
package to manage geospatial data.
R sf objects include:
a dataframe with a geometry column labeled
geometry
a CRS (coordinate reference system), specified
by
For a deeper understanding of the sf package and its
functionality, we recommend
our R-Geospatial-Fundamentals workshop
All geospatial data are referenced to the surface of the earth with a
CRS, or coordinate reference system. Anyone
working with geospatial data will need to develop an understanding of
CRSs.
Fortunately, many of us are familiar with longitude and latitude, which are geographic coordinates. But there are different versions of geographic CRSs. And there are also projected CRSs which transform longitude and latitude to 2 dimensional surface for mapping & analysis.
All census geographic data use the NAD83 geographic CRS.
NAD83 stands for North American Datum of 1983. This CRS (or
version of latitude and longitude) is best for locations in North
America.
Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.
An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.
sf Spatial ObjectsWe can use sf::plot to make a quick map the geometry
stored in an sf spatial object.
# plot the geometry column data
plot(pop2010geo$geometry)
The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.
Fortunately, tidycensus includes a shift_geo parameter
to shift AK & HI to below Texas.
pop2010geo_shifted <- get_decennial(geography = "state",
variables = c(pop10="P001001"),
output="wide",
year = 2010,
geometry=TRUE,
shift_geo=TRUE)
## Warning: The `shift_geo` argument is deprecated and will be removed in a future
## release. We recommend using `tigris::shift_geometry()` instead.
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Using Census Summary File 1
## Please note: Alaska and Hawaii are being shifted and are not to scale.
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## Shift Happens!
plot(pop2010geo_shifted$geometry)
You can save any sf spatial data object to an ESRI shapefile using
st_write
st_write(pop2010geo_shifted, here("data_out/usa_pop2010_shifted.shp"))
Now take a look at the output shapefile.
# Check to see if the data was written out to a shapefile
dir(here("data_out"))
You can see from this output that an ESRI shapefile is actually a collection of files that all have the same prefix.
Shapefiles are the most common file format for geospatial data. So it’s worthwhile to learn more about them if you will be working with census geographic data.
You can use the sf plot command to make a map that sets
the color of the geometry by the data values
This type of map is called a thematic map.
When the features being plotted are areas (or polygons), it’s
called a choropleth map!
# Name the column with the variable values to make
# a thematic map, also called a choropleth map.
plot(pop2010geo_shifted['pop10'])
ggplot2 Mapggplot knows how to map sf objects!
ggplot(pop2010geo_shifted, aes(fill = pop10)) +
geom_sf() # tells ggplot that geographic data are being plotted
If you are familiar with sf objects and
ggplot you can further customize your maps.
ggplot(pop2010geo_shifted, aes(fill = pop10)) +
geom_sf(color=NA) + # What does color=NA do
coord_sf(crs = 3857) + # Dynamically change the CRS
scale_fill_viridis_c(option = "viridis") # Change the color palette
# Try different options, e.g.
# plasma, magma, inferno, cividis
In your Census-Data-in-R-Challenges.Rmd file, create a map of
Median Age by California County in 2010. Solutions are in the Census-Data-in-R-Solutions.Rmd file
We can fetch Census data and geometry for more than one state or county with same function call.
This is so much easier than any alternative approach!
It can be applied to any available geographic tabulation areas (eg states, counties, tracts, places).
Let’s try it with Census Tracts!
Fetch tract population and geometry data for Bay Area Counties.
bay_counties <- c("Alameda", "Contra Costa", "Marin", "San Francisco",
"Sonoma", "Napa","Solano", "San Mateo", "Santa Clara")
bayarea_pop10 <- get_decennial(geography = "tract",
variables = "P001001",
year = 2010,
state='CA',
county=bay_counties,
geometry=T)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Quick map
plot(bayarea_pop10['value'])
Any Questions?
get_acsACS data contains the most recent information about the American population.
We can use the tidycensus function get_acs to
retrieve ACS data using code very similar to
get_decennial.
BUT the workflow is more complex because:
ACS data has a lot more variables, and
ACS data are sample data, so each ACS variable
that you retrieve with tidycensus will fetch both an
estimate of the value and a margin of
error.
The ACS has two primary data products - the ACS 1 year estimates and the 5 year estimates.
The ACS 1 year estimates are more current but hasve
a larger margin of error and is not available for Census geographies
with a population of < 65,000.
The ACS 5-year estimates are more stable but
represent a larger time period.
The ACS 3 year estimates has been
discontinued.
Let’s use the load_variables function to get a dataframe
of all variables from the ACS 2016—2020 5-year dataset.
acs5 where before
we used sf1 to fetch info on the decennial census
variables.vars_acs2020 <- load_variables(year=2020, # end year 2016-2020 period
dataset = 'acs5', # the ACS data product
cache = T) # Save locally for future use
# how many variables?
dim(vars_acs2020)
## [1] 27850 3
View the vars_acs2020 dataframe to find the variable
name for median household income.
#View(vars_acs2020)
Question Is the variable name for
total populationin the ACS 5 year 2020 data the same as it is in the 2010 census data?
Let’s fetch the median household income data for San Francisco County by census tract.
med_hhincome <- get_acs(geography='tract',
variables="B19013_001",
year = 2020,
state='CA',
county='San Francisco',
geometry=TRUE # get the geography too
)
## Getting data from the 2016-2020 5-year ACS
Take a look at the output
head(med_hhincome)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.5099 ymin: 37.70826 xmax: -122.4022 ymax: 37.79016
## Geodetic CRS: NAD83
## GEOID NAME variable
## 1 06075035202 Census Tract 352.02, San Francisco County, California B19013_001
## 2 06075040100 Census Tract 401, San Francisco County, California B19013_001
## 3 06075047702 Census Tract 477.02, San Francisco County, California B19013_001
## 4 06075026403 Census Tract 264.03, San Francisco County, California B19013_001
## 5 06075012100 Census Tract 121, San Francisco County, California B19013_001
## 6 06075020300 Census Tract 203, San Francisco County, California B19013_001
## estimate moe geometry
## 1 99408 19727 MULTIPOLYGON (((-122.5099 3...
## 2 130625 21973 MULTIPOLYGON (((-122.4648 3...
## 3 125156 33460 MULTIPOLYGON (((-122.4887 3...
## 4 72837 18333 MULTIPOLYGON (((-122.4113 3...
## 5 75316 7363 MULTIPOLYGON (((-122.4154 3...
## 6 162059 43630 MULTIPOLYGON (((-122.4352 3...
The census data returned by the get_acs function is a
bit different from that returned by get_decennial.
What is the name of the variable containing the income data?
What is the name if we set output="wide"?
med_hhincome_wide <- get_acs(geography='tract',
variables="B19013_001",
year = 2020,
state='CA',
county='San Francisco',
geometry=TRUE, # get the geography too
output="wide"
)
## Getting data from the 2016-2020 5-year ACS
# uncomment and run to view
# head(med_hhincome_wide)
Use sf::plot to create a map of median household income
in San Francisco.
plot(med_hhincome['estimate'])
What do you think of that map?
It’s odd because San Francisco County is not the same as the city of San Francisco and what we want to map is the city.
We can use ggplot to zoom in on the city by setting the x axis limits to a narrower geographic range.
ggplot(med_hhincome, aes(fill = estimate)) +
geom_sf() +
xlim(-122.55, -122.3)
Why do you think we have NA values in the ACS estimates?
We can drill down into the ACS data by fetching data for subgroups, where available.
Let’s fetch median household income by race.
First identify the variables of interest.
# Median household income by race/ethnicity: Variables from ACS 2015—19
# All households = "B19013_001",
inc_by_race <- c(White = "B19013H_001",
Black = "B19013B_001",
Asian = "B19013D_001",
Hispanic = "B19013I_001" )
Fetch census tract data for multiple variables at once.
# Fetch the Data
alco_mhhinc_by_race <- get_acs(geography='tract',
variables=inc_by_race,
year = 2019,
state='CA',
county='Alameda',
geometry=T )
## Getting data from the 2015-2019 5-year ACS
Facet maps are a way to create visualizations of
small multiples, or subsets of the data in order to
facilitate comparisons. Here, we use ggplot’s facet_wrap
function to make multiple maps of median household income by race for
Alameda County.
# Create the map
medhhinc_facet_map <- alco_mhhinc_by_race %>%
ggplot(aes(fill = estimate)) +
facet_wrap(~variable) +
geom_sf(color=NA) + # why color=NA?
scale_fill_viridis_c(option="magma")
# Display the map
medhhinc_facet_map
In Census-Data-in-R-Challenges.Rmd file, Make a ggplot map of MEDIAN GROSS
RENTin San Francisco County by tract using data from the ACS 2016—2020 5-year product. Check Census-Data-in-R-Solutions.Rmd for answers, as needed.
Any Questions?
Interactive mapping gives the RStudio environment some of the functionality of desktop GIS.
There are a number of R packages that you can use, including:
mapview: quick interactive exploratory data viewing
tmap: great static and interactive maps
Leaflet: highly customizable interactive maps
All of these are based on the
Leaflet Javascript Library.
Let’s use mapview to make quick interactive maps of the
median hhousehold income data
mapview(med_hhincome)
When passed the name of an sf object and no other
options, mapview will:
display the geometry using a single color for the fill and for the stroke
display the feature ID on hover
display the data from the dataframe
on click
The zcol argument will take a column name and color the
features by the values in that column.
mapview(med_hhincome, zcol="estimate")
In the Census-Data-in-R-Challenges.Rmd, use
mapviewto create an interactive map of median household rent.
ACS variables can be confusing.
Some ways to identify the best variables to explore:
Web search, especially Census web resources, can help.
The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.
The NHGIS website (nhgis.org) is a great way to browse variables of interest.
We haven’t talked about it but it may be important in your work with ACS data.
Math is needed to combine MOEs when you combine variables.
tidycensus includes some nice functions
for these calculations and a good overview of the topic.tidycensus offers two key functions for fetching census
tabular and geographic: get_acs and
get_decennial.
Support for fetching population estimates and
migration flow census data was recently added to
tidycensus. You can read up on it on the tidycensus
documentation website
Using tidycensus to fetch the tabular data or both
tabular and geographic data is IMO way easier than any alternatives,
IF you (1) know R, (2) know a bit about working with
geographic data in R.
This approach is also scaleable if you want multiple census variables for various locations and tabulation areas.
You can make publication or report ready maps with highly
customizable ggplot2 code or use the sf::plot
command to make quick maps.
Interactive mapping greatly enhances your ability to do exploratory data analysis in RStudio.
Much of this tutorial is based on resources by Kyle Walker, author of
tidycensus. See:
Related D-Lab Workshops
Great online resource for working with spatial data in R